SOLR-18060: Add Prometheus metrics to CrossDC Consumer.#4063
SOLR-18060: Add Prometheus metrics to CrossDC Consumer.#4063sigram merged 16 commits intoapache:mainfrom
Conversation
mlbiscoc
left a comment
There was a problem hiding this comment.
Can you post a sample of all these metrics? Either dump it here or in a txt file? It would be easier to review the names and labels on the metrics.
solr/cross-dc-manager/src/test/org/apache/solr/crossdc/manager/SolrAndKafkaIntegrationTest.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/PrometheusMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/PrometheusMetrics.java
Outdated
Show resolved
Hide resolved
| Counter.builder() | ||
| .name("consumer_input_total") | ||
| .help("Total number of input messages") | ||
| .labelNames("type", "subtype") |
There was a problem hiding this comment.
I question most of these metrics really need type label. What is the cardinality of it and possible different combinations? I see in the test UPDATE is one. Is there also QUERY or something along those lines?
There was a problem hiding this comment.
Yes, there's ADMIN and CONFIGSET.
There was a problem hiding this comment.
Hmmm ok. I am not a fan of naming this label being called type. I think it should have some kind of context what it means as type and subtype can be very generic. Is it an operation or message_type maybe? Then what can subtype be? In core, I made it category but it is debateable if we should just remove that label/attribute all together from metrics. If you move type to something more specific then maybe you can just move off subtype to type. Again seeing an sample text output of these metrics would help if you can.
There was a problem hiding this comment.
It's a request type - there are currently three types: UPDATE, ADMIN and CONFIGSET. Sub-type is primarily for UPDATE (add, dbi, dbq) and ADMIN (path).
Here's a sample output:
# HELP crossdc_consumer_input_total Total number of input messages
# TYPE crossdc_consumer_input_total counter
crossdc_consumer_input_total{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_total Total number of output requests
# TYPE crossdc_consumer_output_total counter
crossdc_consumer_output_total{otel_scope_name="org.apache.solr",result="handled",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_batch_size Histogram of output batch sizes
# TYPE crossdc_consumer_output_batch_size histogram
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="5.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="10.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="25.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="50.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="75.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="100.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="250.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="750.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="1000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="2500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="5000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="7500.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="10000.0"} 1
crossdc_consumer_output_batch_size_bucket{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_batch_size_count{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1
crossdc_consumer_output_batch_size_sum{otel_scope_name="org.apache.solr",subtype="add",type="UPDATE"} 1.0
# HELP crossdc_consumer_output_first_attempt_time_nanoseconds Histogram of first attempt request times
# TYPE crossdc_consumer_output_first_attempt_time_nanoseconds histogram
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="25000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="50000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="100000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="250000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="500000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1000000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2500000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000000.0"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2.5E7"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1.0E8"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1.0E9"} 0
crossdc_consumer_output_first_attempt_time_nanoseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_first_attempt_time_nanoseconds_count{otel_scope_name="org.apache.solr",type="UPDATE"} 1
crossdc_consumer_output_first_attempt_time_nanoseconds_sum{otel_scope_name="org.apache.solr",type="UPDATE"} 1.7667254470782164E18
# HELP crossdc_consumer_output_time_milliseconds Histogram of output request times
# TYPE crossdc_consumer_output_time_milliseconds histogram
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="0.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10.0"} 0
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="25.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="50.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="75.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="100.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="250.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="750.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="1000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="2500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="5000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="7500.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="10000.0"} 1
crossdc_consumer_output_time_milliseconds_bucket{otel_scope_name="org.apache.solr",type="UPDATE",le="+Inf"} 1
crossdc_consumer_output_time_milliseconds_count{otel_scope_name="org.apache.solr",type="UPDATE"} 1
crossdc_consumer_output_time_milliseconds_sum{otel_scope_name="org.apache.solr",type="UPDATE"} 13.0
# TYPE target_info gauge
target_info{service_name="unknown_service:java",telemetry_sdk_language="java",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.56.0"} 1
...cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/KafkaCrossDcConsumer.java
Outdated
Show resolved
Hide resolved
solr/test-framework/src/java/org/apache/solr/util/SolrKafkaTestsIgnoredThreadsFilter.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/PrometheusMetrics.java
Outdated
Show resolved
Hide resolved
...cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/KafkaCrossDcConsumer.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
...cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/KafkaCrossDcConsumer.java
Outdated
Show resolved
Hide resolved
| public static final String ATTR_SUBTYPE = "subtype"; | ||
| public static final String ATTR_RESULT = "result"; | ||
|
|
||
| protected final Map<String, Attributes> attributesCache = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
To avoid repeatedly creating millions of small objects (Attributes) when updating metrics.
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/Util.java
Outdated
Show resolved
Hide resolved
mlbiscoc
left a comment
There was a problem hiding this comment.
Just did another run through. Liking the changes. Just a few more comments.
| log.trace("result=nothandled_shutdown"); | ||
| } | ||
| metrics.counter(MetricRegistry.name(type.name(), "nothandled_shutdown")).inc(); | ||
| metrics.incrementOutputCounter(type.name(), "nothandled_shutdown"); |
There was a problem hiding this comment.
Maybe unhandled_shutdown instead of nothandled
| protected LongCounter inputMsg; | ||
| protected LongCounter inputReq; | ||
| protected LongCounter collapsed; | ||
| protected LongCounter output; | ||
| protected LongHistogram outputBatchSizeHistogram; | ||
| protected LongHistogram outputTimeHistogram; | ||
| protected LongHistogram outputBackoffHistogram; | ||
| protected LongHistogram outputFirstAttemptHistogram; |
There was a problem hiding this comment.
I had created this Attributed instrument wrappers so you could bind attributes to instruments which may have worked here and how all of Solr core does it. I probably should have mentioned that earlier but my fault. Honestly not a blocker and fine with this direction.
|
|
||
| if (status != 0) { | ||
| metrics.counter(MetricRegistry.name(type.name(), "outputErrors")).inc(); | ||
| metrics.incrementOutputCounter(type.name(), "solrError"); |
There was a problem hiding this comment.
solr_error, or maybe just error. Should be known its in the context of Solr already
solr/cross-dc-manager/build.gradle
Outdated
| // implementation libs.prometheus.metrics.model | ||
| // implementation(libs.prometheus.metrics.expositionformats, { | ||
| // exclude group: "io.prometheus", module: "prometheus-metrics-shaded-protobuf" | ||
| // exclude group: "io.prometheus", module: "prometheus-metrics-config" | ||
| // }) |
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
solr/cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/OtelMetrics.java
Outdated
Show resolved
Hide resolved
...cross-dc-manager/src/java/org/apache/solr/crossdc/manager/consumer/KafkaCrossDcConsumer.java
Outdated
Show resolved
Hide resolved
gradle/libs.versions.toml
Outdated
| prometheus-metrics-core = { module = "io.prometheus:prometheus-metrics-core", version.ref = "prometheus-metrics" } | ||
| prometheus-metrics-exporter-servlet-jakarta = { module = "io.prometheus:prometheus-metrics-exporter-servlet-jakarta", version.ref = "prometheus-metrics" } |
There was a problem hiding this comment.
Do we still need this? Otherwise just remove it.
This PR replaces Dropwizard JSON metrics with Prometheus metrics in the CrossDC Consumer, using directly the Prometheus client_java API. It also removes the Dropwizard dependency.